-
Notifications
You must be signed in to change notification settings - Fork 479
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fast FBGEMM path KT.regroup_as #1910
Closed
Closed
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This pull request was exported from Phabricator. Differential Revision: D56392296 |
dstaay-fb
added a commit
to dstaay-fb/torchrec
that referenced
this pull request
Apr 23, 2024
Summary: Use custom FBGEMM kernel when possible for inference/training. ~0-75% runtime speedup. Benchmark Results [Forward] [fallback] _regroup_keyed_tenors | B: 512 | F: 80 | device: cuda | Runtime (P90): 0.4 ms | Memory (P90): 24.0 [prod] KeyedTensor.regroup | B: 512 | F: 80 | device: cuda | Runtime (P90): 0.4 ms | Memory (P90): 36.0 [fallback] _regroup_keyed_tenors | B: 512 | F: 160 | device: cuda | Runtime (P90): 0.8 ms | Memory (P90): 48.0 [prod] KeyedTensor.regroup | B: 512 | F: 160 | device: cuda | Runtime (P90): 0.6 ms | Memory (P90): 72.0 [fallback] _regroup_keyed_tenors | B: 512 | F: 320 | device: cuda | Runtime (P90): 1.9 ms | Memory (P90): 96.0 [prod] KeyedTensor.regroup | B: 512 | F: 320 | device: cuda | Runtime (P90): 0.7 ms | Memory (P90): 144.0 [fallback] _regroup_keyed_tenors | B: 512 | F: 640 | device: cuda | Runtime (P90): 4.6 ms | Memory (P90): 192.0 [prod] KeyedTensor.regroup | B: 512 | F: 640 | device: cuda | Runtime (P90): 1.3 ms | Memory (P90): 288.0 [fallback] _regroup_keyed_tenors | B: 512 | F: 1280 | device: cuda | Runtime (P90): 13.2 ms | Memory (P90): 384.0 [prod] KeyedTensor.regroup | B: 512 | F: 1280 | device: cuda | Runtime (P90): 2.2 ms | Memory (P90): 576.0 [fallback] _regroup_keyed_tenors | B: 1024 | F: 80 | device: cuda | Runtime (P90): 0.3 ms | Memory (P90): 48.0 [prod] KeyedTensor.regroup | B: 1024 | F: 80 | device: cuda | Runtime (P90): 0.4 ms | Memory (P90): 72.0 [fallback] _regroup_keyed_tenors | B: 1024 | F: 160 | device: cuda | Runtime (P90): 0.8 ms | Memory (P90): 96.0 [prod] KeyedTensor.regroup | B: 1024 | F: 160 | device: cuda | Runtime (P90): 0.6 ms | Memory (P90): 144.0 [fallback] _regroup_keyed_tenors | B: 1024 | F: 320 | device: cuda | Runtime (P90): 1.8 ms | Memory (P90): 192.0 [prod] KeyedTensor.regroup | B: 1024 | F: 320 | device: cuda | Runtime (P90): 0.9 ms | Memory (P90): 288.0 [fallback] _regroup_keyed_tenors | B: 1024 | F: 640 | device: cuda | Runtime (P90): 4.1 ms | Memory (P90): 384.0 [prod] KeyedTensor.regroup | B: 1024 | F: 640 | device: cuda | Runtime (P90): 1.6 ms | Memory (P90): 576.0 [fallback] _regroup_keyed_tenors | B: 1024 | F: 1280 | device: cuda | Runtime (P90): 12.8 ms | Memory (P90): 768.0 [prod] KeyedTensor.regroup | B: 1024 | F: 1280 | device: cuda | Runtime (P90): 3.1 ms | Memory (P90): 1152.0 [fallback] _regroup_keyed_tenors | B: 2048 | F: 80 | device: cuda | Runtime (P90): 0.4 ms | Memory (P90): 96.0 [prod] KeyedTensor.regroup | B: 2048 | F: 80 | device: cuda | Runtime (P90): 0.5 ms | Memory (P90): 144.0 [fallback] _regroup_keyed_tenors | B: 2048 | F: 160 | device: cuda | Runtime (P90): 0.7 ms | Memory (P90): 192.0 [prod] KeyedTensor.regroup | B: 2048 | F: 160 | device: cuda | Runtime (P90): 0.8 ms | Memory (P90): 288.0 [fallback] _regroup_keyed_tenors | B: 2048 | F: 320 | device: cuda | Runtime (P90): 1.6 ms | Memory (P90): 384.0 [prod] KeyedTensor.regroup | B: 2048 | F: 320 | device: cuda | Runtime (P90): 1.4 ms | Memory (P90): 576.0 [fallback] _regroup_keyed_tenors | B: 2048 | F: 640 | device: cuda | Runtime (P90): 4.8 ms | Memory (P90): 768.0 [prod] KeyedTensor.regroup | B: 2048 | F: 640 | device: cuda | Runtime (P90): 2.8 ms | Memory (P90): 1152.0 [fallback] _regroup_keyed_tenors | B: 2048 | F: 1280 | device: cuda | Runtime (P90): 12.5 ms | Memory (P90): 1536.0 [prod] KeyedTensor.regroup | B: 2048 | F: 1280 | device: cuda | Runtime (P90): 5.6 ms | Memory (P90): 2304.0 [fallback] _regroup_keyed_tenors | B: 4096 | F: 80 | device: cuda | Runtime (P90): 0.4 ms | Memory (P90): 192.0 [prod] KeyedTensor.regroup | B: 4096 | F: 80 | device: cuda | Runtime (P90): 0.8 ms | Memory (P90): 288.0 [fallback] _regroup_keyed_tenors | B: 4096 | F: 160 | device: cuda | Runtime (P90): 0.9 ms | Memory (P90): 384.0 [prod] KeyedTensor.regroup | B: 4096 | F: 160 | device: cuda | Runtime (P90): 1.4 ms | Memory (P90): 576.0 [fallback] _regroup_keyed_tenors | B: 4096 | F: 320 | device: cuda | Runtime (P90): 1.7 ms | Memory (P90): 768.0 [prod] KeyedTensor.regroup | B: 4096 | F: 320 | device: cuda | Runtime (P90): 2.8 ms | Memory (P90): 1152.0 [fallback] _regroup_keyed_tenors | B: 4096 | F: 640 | device: cuda | Runtime (P90): 4.1 ms | Memory (P90): 1536.0 [prod] KeyedTensor.regroup | B: 4096 | F: 640 | device: cuda | Runtime (P90): 5.6 ms | Memory (P90): 2304.0 [fallback] _regroup_keyed_tenors | B: 4096 | F: 1280 | device: cuda | Runtime (P90): 12.2 ms | Memory (P90): 3072.0 [prod] KeyedTensor.regroup | B: 4096 | F: 1280 | device: cuda | Runtime (P90): 11.1 ms | Memory (P90): 4608.0 Benchmark Results [Fowrard + Backward] [prod] KeyedTensor.regroup | B: 512 | F: 80 | device: cuda | Runtime (P90): 2.2 ms | Memory (P90): 72.0 [fallback] _regroup_keyed_tenors | B: 512 | F: 160 | device: cuda | Runtime (P90): 4.7 ms | Memory (P90): 144.0 [prod] KeyedTensor.regroup | B: 512 | F: 160 | device: cuda | Runtime (P90): 3.4 ms | Memory (P90): 144.0 [fallback] _regroup_keyed_tenors | B: 512 | F: 320 | device: cuda | Runtime (P90): 9.0 ms | Memory (P90): 288.0 [prod] KeyedTensor.regroup | B: 512 | F: 320 | device: cuda | Runtime (P90): 6.5 ms | Memory (P90): 288.0 [fallback] _regroup_keyed_tenors | B: 512 | F: 640 | device: cuda | Runtime (P90): 19.9 ms | Memory (P90): 576.0 [prod] KeyedTensor.regroup | B: 512 | F: 640 | device: cuda | Runtime (P90): 11.4 ms | Memory (P90): 576.0 [fallback] _regroup_keyed_tenors | B: 512 | F: 1280 | device: cuda | Runtime (P90): 46.7 ms | Memory (P90): 1152.0 [prod] KeyedTensor.regroup | B: 512 | F: 1280 | device: cuda | Runtime (P90): 23.1 ms | Memory (P90): 1152.0 [fallback] _regroup_keyed_tenors | B: 1024 | F: 80 | device: cuda | Runtime (P90): 2.6 ms | Memory (P90): 144.0 [prod] KeyedTensor.regroup | B: 1024 | F: 80 | device: cuda | Runtime (P90): 2.5 ms | Memory (P90): 144.0 [fallback] _regroup_keyed_tenors | B: 1024 | F: 160 | device: cuda | Runtime (P90): 4.5 ms | Memory (P90): 288.0 [prod] KeyedTensor.regroup | B: 1024 | F: 160 | device: cuda | Runtime (P90): 3.9 ms | Memory (P90): 288.0 [fallback] _regroup_keyed_tenors | B: 1024 | F: 320 | device: cuda | Runtime (P90): 8.8 ms | Memory (P90): 576.0 [prod] KeyedTensor.regroup | B: 1024 | F: 320 | device: cuda | Runtime (P90): 6.7 ms | Memory (P90): 576.0 [fallback] _regroup_keyed_tenors | B: 1024 | F: 640 | device: cuda | Runtime (P90): 18.7 ms | Memory (P90): 1152.0 [prod] KeyedTensor.regroup | B: 1024 | F: 640 | device: cuda | Runtime (P90): 12.2 ms | Memory (P90): 1152.0 [fallback] _regroup_keyed_tenors | B: 1024 | F: 1280 | device: cuda | Runtime (P90): 42.8 ms | Memory (P90): 2304.0 [prod] KeyedTensor.regroup | B: 1024 | F: 1280 | device: cuda | Runtime (P90): 23.1 ms | Memory (P90): 2304.0 [fallback] _regroup_keyed_tenors | B: 2048 | F: 80 | device: cuda | Runtime (P90): 2.5 ms | Memory (P90): 288.0 [prod] KeyedTensor.regroup | B: 2048 | F: 80 | device: cuda | Runtime (P90): 2.4 ms | Memory (P90): 288.0 [fallback] _regroup_keyed_tenors | B: 2048 | F: 160 | device: cuda | Runtime (P90): 4.5 ms | Memory (P90): 576.0 [prod] KeyedTensor.regroup | B: 2048 | F: 160 | device: cuda | Runtime (P90): 4.2 ms | Memory (P90): 576.0 [fallback] _regroup_keyed_tenors | B: 2048 | F: 320 | device: cuda | Runtime (P90): 8.9 ms | Memory (P90): 1152.0 [prod] KeyedTensor.regroup | B: 2048 | F: 320 | device: cuda | Runtime (P90): 7.7 ms | Memory (P90): 1152.0 [fallback] _regroup_keyed_tenors | B: 2048 | F: 640 | device: cuda | Runtime (P90): 19.2 ms | Memory (P90): 2304.0 [prod] KeyedTensor.regroup | B: 2048 | F: 640 | device: cuda | Runtime (P90): 12.9 ms | Memory (P90): 2304.0 [fallback] _regroup_keyed_tenors | B: 2048 | F: 1280 | device: cuda | Runtime (P90): 45.1 ms | Memory (P90): 4608.0 [prod] KeyedTensor.regroup | B: 2048 | F: 1280 | device: cuda | Runtime (P90): 26.4 ms | Memory (P90): 4608.0 [fallback] _regroup_keyed_tenors | B: 4096 | F: 80 | device: cuda | Runtime (P90): 2.4 ms | Memory (P90): 576.0 [prod] KeyedTensor.regroup | B: 4096 | F: 80 | device: cuda | Runtime (P90): 2.7 ms | Memory (P90): 576.0 [fallback] _regroup_keyed_tenors | B: 4096 | F: 160 | device: cuda | Runtime (P90): 4.4 ms | Memory (P90): 1152.0 [prod] KeyedTensor.regroup | B: 4096 | F: 160 | device: cuda | Runtime (P90): 4.4 ms | Memory (P90): 1152.0 [fallback] _regroup_keyed_tenors | B: 4096 | F: 320 | device: cuda | Runtime (P90): 8.4 ms | Memory (P90): 2304.0 [prod] KeyedTensor.regroup | B: 4096 | F: 320 | device: cuda | Runtime (P90): 8.1 ms | Memory (P90): 2304.0 [fallback] _regroup_keyed_tenors | B: 4096 | F: 640 | device: cuda | Runtime (P90): 28.0 ms | Memory (P90): 4608.0 [prod] KeyedTensor.regroup | B: 4096 | F: 640 | device: cuda | Runtime (P90): 15.6 ms | Memory (P90): 4608.0 [fallback] _regroup_keyed_tenors | B: 4096 | F: 1280 | device: cuda | Runtime (P90): 43.2 ms | Memory (P90): 9216.0 [prod] KeyedTensor.regroup | B: 4096 | F: 1280 | device: cuda | Runtime (P90): 31.2 ms | Memory (P90): 9216.0 Differential Revision: D56392296
e2c6408
to
42b8fab
Compare
This pull request was exported from Phabricator. Differential Revision: D56392296 |
dstaay-fb
added a commit
to dstaay-fb/torchrec
that referenced
this pull request
Apr 23, 2024
Summary: Use custom FBGEMM kernel when possible for inference/training. ~0-75% runtime speedup. Benchmark Results [Forward] [fallback] _regroup_keyed_tenors | B: 512 | F: 80 | device: cuda | Runtime (P90): 0.4 ms | Memory (P90): 24.0 [prod] KeyedTensor.regroup | B: 512 | F: 80 | device: cuda | Runtime (P90): 0.4 ms | Memory (P90): 36.0 [fallback] _regroup_keyed_tenors | B: 512 | F: 160 | device: cuda | Runtime (P90): 0.8 ms | Memory (P90): 48.0 [prod] KeyedTensor.regroup | B: 512 | F: 160 | device: cuda | Runtime (P90): 0.6 ms | Memory (P90): 72.0 [fallback] _regroup_keyed_tenors | B: 512 | F: 320 | device: cuda | Runtime (P90): 1.9 ms | Memory (P90): 96.0 [prod] KeyedTensor.regroup | B: 512 | F: 320 | device: cuda | Runtime (P90): 0.7 ms | Memory (P90): 144.0 [fallback] _regroup_keyed_tenors | B: 512 | F: 640 | device: cuda | Runtime (P90): 4.6 ms | Memory (P90): 192.0 [prod] KeyedTensor.regroup | B: 512 | F: 640 | device: cuda | Runtime (P90): 1.3 ms | Memory (P90): 288.0 [fallback] _regroup_keyed_tenors | B: 512 | F: 1280 | device: cuda | Runtime (P90): 13.2 ms | Memory (P90): 384.0 [prod] KeyedTensor.regroup | B: 512 | F: 1280 | device: cuda | Runtime (P90): 2.2 ms | Memory (P90): 576.0 [fallback] _regroup_keyed_tenors | B: 1024 | F: 80 | device: cuda | Runtime (P90): 0.3 ms | Memory (P90): 48.0 [prod] KeyedTensor.regroup | B: 1024 | F: 80 | device: cuda | Runtime (P90): 0.4 ms | Memory (P90): 72.0 [fallback] _regroup_keyed_tenors | B: 1024 | F: 160 | device: cuda | Runtime (P90): 0.8 ms | Memory (P90): 96.0 [prod] KeyedTensor.regroup | B: 1024 | F: 160 | device: cuda | Runtime (P90): 0.6 ms | Memory (P90): 144.0 [fallback] _regroup_keyed_tenors | B: 1024 | F: 320 | device: cuda | Runtime (P90): 1.8 ms | Memory (P90): 192.0 [prod] KeyedTensor.regroup | B: 1024 | F: 320 | device: cuda | Runtime (P90): 0.9 ms | Memory (P90): 288.0 [fallback] _regroup_keyed_tenors | B: 1024 | F: 640 | device: cuda | Runtime (P90): 4.1 ms | Memory (P90): 384.0 [prod] KeyedTensor.regroup | B: 1024 | F: 640 | device: cuda | Runtime (P90): 1.6 ms | Memory (P90): 576.0 [fallback] _regroup_keyed_tenors | B: 1024 | F: 1280 | device: cuda | Runtime (P90): 12.8 ms | Memory (P90): 768.0 [prod] KeyedTensor.regroup | B: 1024 | F: 1280 | device: cuda | Runtime (P90): 3.1 ms | Memory (P90): 1152.0 [fallback] _regroup_keyed_tenors | B: 2048 | F: 80 | device: cuda | Runtime (P90): 0.4 ms | Memory (P90): 96.0 [prod] KeyedTensor.regroup | B: 2048 | F: 80 | device: cuda | Runtime (P90): 0.5 ms | Memory (P90): 144.0 [fallback] _regroup_keyed_tenors | B: 2048 | F: 160 | device: cuda | Runtime (P90): 0.7 ms | Memory (P90): 192.0 [prod] KeyedTensor.regroup | B: 2048 | F: 160 | device: cuda | Runtime (P90): 0.8 ms | Memory (P90): 288.0 [fallback] _regroup_keyed_tenors | B: 2048 | F: 320 | device: cuda | Runtime (P90): 1.6 ms | Memory (P90): 384.0 [prod] KeyedTensor.regroup | B: 2048 | F: 320 | device: cuda | Runtime (P90): 1.4 ms | Memory (P90): 576.0 [fallback] _regroup_keyed_tenors | B: 2048 | F: 640 | device: cuda | Runtime (P90): 4.8 ms | Memory (P90): 768.0 [prod] KeyedTensor.regroup | B: 2048 | F: 640 | device: cuda | Runtime (P90): 2.8 ms | Memory (P90): 1152.0 [fallback] _regroup_keyed_tenors | B: 2048 | F: 1280 | device: cuda | Runtime (P90): 12.5 ms | Memory (P90): 1536.0 [prod] KeyedTensor.regroup | B: 2048 | F: 1280 | device: cuda | Runtime (P90): 5.6 ms | Memory (P90): 2304.0 [fallback] _regroup_keyed_tenors | B: 4096 | F: 80 | device: cuda | Runtime (P90): 0.4 ms | Memory (P90): 192.0 [prod] KeyedTensor.regroup | B: 4096 | F: 80 | device: cuda | Runtime (P90): 0.8 ms | Memory (P90): 288.0 [fallback] _regroup_keyed_tenors | B: 4096 | F: 160 | device: cuda | Runtime (P90): 0.9 ms | Memory (P90): 384.0 [prod] KeyedTensor.regroup | B: 4096 | F: 160 | device: cuda | Runtime (P90): 1.4 ms | Memory (P90): 576.0 [fallback] _regroup_keyed_tenors | B: 4096 | F: 320 | device: cuda | Runtime (P90): 1.7 ms | Memory (P90): 768.0 [prod] KeyedTensor.regroup | B: 4096 | F: 320 | device: cuda | Runtime (P90): 2.8 ms | Memory (P90): 1152.0 [fallback] _regroup_keyed_tenors | B: 4096 | F: 640 | device: cuda | Runtime (P90): 4.1 ms | Memory (P90): 1536.0 [prod] KeyedTensor.regroup | B: 4096 | F: 640 | device: cuda | Runtime (P90): 5.6 ms | Memory (P90): 2304.0 [fallback] _regroup_keyed_tenors | B: 4096 | F: 1280 | device: cuda | Runtime (P90): 12.2 ms | Memory (P90): 3072.0 [prod] KeyedTensor.regroup | B: 4096 | F: 1280 | device: cuda | Runtime (P90): 11.1 ms | Memory (P90): 4608.0 Benchmark Results [Fowrard + Backward] [prod] KeyedTensor.regroup | B: 512 | F: 80 | device: cuda | Runtime (P90): 2.2 ms | Memory (P90): 72.0 [fallback] _regroup_keyed_tenors | B: 512 | F: 160 | device: cuda | Runtime (P90): 4.7 ms | Memory (P90): 144.0 [prod] KeyedTensor.regroup | B: 512 | F: 160 | device: cuda | Runtime (P90): 3.4 ms | Memory (P90): 144.0 [fallback] _regroup_keyed_tenors | B: 512 | F: 320 | device: cuda | Runtime (P90): 9.0 ms | Memory (P90): 288.0 [prod] KeyedTensor.regroup | B: 512 | F: 320 | device: cuda | Runtime (P90): 6.5 ms | Memory (P90): 288.0 [fallback] _regroup_keyed_tenors | B: 512 | F: 640 | device: cuda | Runtime (P90): 19.9 ms | Memory (P90): 576.0 [prod] KeyedTensor.regroup | B: 512 | F: 640 | device: cuda | Runtime (P90): 11.4 ms | Memory (P90): 576.0 [fallback] _regroup_keyed_tenors | B: 512 | F: 1280 | device: cuda | Runtime (P90): 46.7 ms | Memory (P90): 1152.0 [prod] KeyedTensor.regroup | B: 512 | F: 1280 | device: cuda | Runtime (P90): 23.1 ms | Memory (P90): 1152.0 [fallback] _regroup_keyed_tenors | B: 1024 | F: 80 | device: cuda | Runtime (P90): 2.6 ms | Memory (P90): 144.0 [prod] KeyedTensor.regroup | B: 1024 | F: 80 | device: cuda | Runtime (P90): 2.5 ms | Memory (P90): 144.0 [fallback] _regroup_keyed_tenors | B: 1024 | F: 160 | device: cuda | Runtime (P90): 4.5 ms | Memory (P90): 288.0 [prod] KeyedTensor.regroup | B: 1024 | F: 160 | device: cuda | Runtime (P90): 3.9 ms | Memory (P90): 288.0 [fallback] _regroup_keyed_tenors | B: 1024 | F: 320 | device: cuda | Runtime (P90): 8.8 ms | Memory (P90): 576.0 [prod] KeyedTensor.regroup | B: 1024 | F: 320 | device: cuda | Runtime (P90): 6.7 ms | Memory (P90): 576.0 [fallback] _regroup_keyed_tenors | B: 1024 | F: 640 | device: cuda | Runtime (P90): 18.7 ms | Memory (P90): 1152.0 [prod] KeyedTensor.regroup | B: 1024 | F: 640 | device: cuda | Runtime (P90): 12.2 ms | Memory (P90): 1152.0 [fallback] _regroup_keyed_tenors | B: 1024 | F: 1280 | device: cuda | Runtime (P90): 42.8 ms | Memory (P90): 2304.0 [prod] KeyedTensor.regroup | B: 1024 | F: 1280 | device: cuda | Runtime (P90): 23.1 ms | Memory (P90): 2304.0 [fallback] _regroup_keyed_tenors | B: 2048 | F: 80 | device: cuda | Runtime (P90): 2.5 ms | Memory (P90): 288.0 [prod] KeyedTensor.regroup | B: 2048 | F: 80 | device: cuda | Runtime (P90): 2.4 ms | Memory (P90): 288.0 [fallback] _regroup_keyed_tenors | B: 2048 | F: 160 | device: cuda | Runtime (P90): 4.5 ms | Memory (P90): 576.0 [prod] KeyedTensor.regroup | B: 2048 | F: 160 | device: cuda | Runtime (P90): 4.2 ms | Memory (P90): 576.0 [fallback] _regroup_keyed_tenors | B: 2048 | F: 320 | device: cuda | Runtime (P90): 8.9 ms | Memory (P90): 1152.0 [prod] KeyedTensor.regroup | B: 2048 | F: 320 | device: cuda | Runtime (P90): 7.7 ms | Memory (P90): 1152.0 [fallback] _regroup_keyed_tenors | B: 2048 | F: 640 | device: cuda | Runtime (P90): 19.2 ms | Memory (P90): 2304.0 [prod] KeyedTensor.regroup | B: 2048 | F: 640 | device: cuda | Runtime (P90): 12.9 ms | Memory (P90): 2304.0 [fallback] _regroup_keyed_tenors | B: 2048 | F: 1280 | device: cuda | Runtime (P90): 45.1 ms | Memory (P90): 4608.0 [prod] KeyedTensor.regroup | B: 2048 | F: 1280 | device: cuda | Runtime (P90): 26.4 ms | Memory (P90): 4608.0 [fallback] _regroup_keyed_tenors | B: 4096 | F: 80 | device: cuda | Runtime (P90): 2.4 ms | Memory (P90): 576.0 [prod] KeyedTensor.regroup | B: 4096 | F: 80 | device: cuda | Runtime (P90): 2.7 ms | Memory (P90): 576.0 [fallback] _regroup_keyed_tenors | B: 4096 | F: 160 | device: cuda | Runtime (P90): 4.4 ms | Memory (P90): 1152.0 [prod] KeyedTensor.regroup | B: 4096 | F: 160 | device: cuda | Runtime (P90): 4.4 ms | Memory (P90): 1152.0 [fallback] _regroup_keyed_tenors | B: 4096 | F: 320 | device: cuda | Runtime (P90): 8.4 ms | Memory (P90): 2304.0 [prod] KeyedTensor.regroup | B: 4096 | F: 320 | device: cuda | Runtime (P90): 8.1 ms | Memory (P90): 2304.0 [fallback] _regroup_keyed_tenors | B: 4096 | F: 640 | device: cuda | Runtime (P90): 28.0 ms | Memory (P90): 4608.0 [prod] KeyedTensor.regroup | B: 4096 | F: 640 | device: cuda | Runtime (P90): 15.6 ms | Memory (P90): 4608.0 [fallback] _regroup_keyed_tenors | B: 4096 | F: 1280 | device: cuda | Runtime (P90): 43.2 ms | Memory (P90): 9216.0 [prod] KeyedTensor.regroup | B: 4096 | F: 1280 | device: cuda | Runtime (P90): 31.2 ms | Memory (P90): 9216.0 Differential Revision: D56392296
42b8fab
to
d172d1e
Compare
This pull request was exported from Phabricator. Differential Revision: D56392296 |
Summary: Use custom FBGEMM kernel when possible for inference/training. ~0-75% runtime speedup. Benchmark Results [Forward] [fallback] _regroup_keyed_tenors | B: 512 | F: 80 | device: cuda | Runtime (P90): 0.4 ms | Memory (P90): 24.0 [prod] KeyedTensor.regroup | B: 512 | F: 80 | device: cuda | Runtime (P90): 0.4 ms | Memory (P90): 36.0 [fallback] _regroup_keyed_tenors | B: 512 | F: 160 | device: cuda | Runtime (P90): 0.8 ms | Memory (P90): 48.0 [prod] KeyedTensor.regroup | B: 512 | F: 160 | device: cuda | Runtime (P90): 0.6 ms | Memory (P90): 72.0 [fallback] _regroup_keyed_tenors | B: 512 | F: 320 | device: cuda | Runtime (P90): 1.9 ms | Memory (P90): 96.0 [prod] KeyedTensor.regroup | B: 512 | F: 320 | device: cuda | Runtime (P90): 0.7 ms | Memory (P90): 144.0 [fallback] _regroup_keyed_tenors | B: 512 | F: 640 | device: cuda | Runtime (P90): 4.6 ms | Memory (P90): 192.0 [prod] KeyedTensor.regroup | B: 512 | F: 640 | device: cuda | Runtime (P90): 1.3 ms | Memory (P90): 288.0 [fallback] _regroup_keyed_tenors | B: 512 | F: 1280 | device: cuda | Runtime (P90): 13.2 ms | Memory (P90): 384.0 [prod] KeyedTensor.regroup | B: 512 | F: 1280 | device: cuda | Runtime (P90): 2.2 ms | Memory (P90): 576.0 [fallback] _regroup_keyed_tenors | B: 1024 | F: 80 | device: cuda | Runtime (P90): 0.3 ms | Memory (P90): 48.0 [prod] KeyedTensor.regroup | B: 1024 | F: 80 | device: cuda | Runtime (P90): 0.4 ms | Memory (P90): 72.0 [fallback] _regroup_keyed_tenors | B: 1024 | F: 160 | device: cuda | Runtime (P90): 0.8 ms | Memory (P90): 96.0 [prod] KeyedTensor.regroup | B: 1024 | F: 160 | device: cuda | Runtime (P90): 0.6 ms | Memory (P90): 144.0 [fallback] _regroup_keyed_tenors | B: 1024 | F: 320 | device: cuda | Runtime (P90): 1.8 ms | Memory (P90): 192.0 [prod] KeyedTensor.regroup | B: 1024 | F: 320 | device: cuda | Runtime (P90): 0.9 ms | Memory (P90): 288.0 [fallback] _regroup_keyed_tenors | B: 1024 | F: 640 | device: cuda | Runtime (P90): 4.1 ms | Memory (P90): 384.0 [prod] KeyedTensor.regroup | B: 1024 | F: 640 | device: cuda | Runtime (P90): 1.6 ms | Memory (P90): 576.0 [fallback] _regroup_keyed_tenors | B: 1024 | F: 1280 | device: cuda | Runtime (P90): 12.8 ms | Memory (P90): 768.0 [prod] KeyedTensor.regroup | B: 1024 | F: 1280 | device: cuda | Runtime (P90): 3.1 ms | Memory (P90): 1152.0 [fallback] _regroup_keyed_tenors | B: 2048 | F: 80 | device: cuda | Runtime (P90): 0.4 ms | Memory (P90): 96.0 [prod] KeyedTensor.regroup | B: 2048 | F: 80 | device: cuda | Runtime (P90): 0.5 ms | Memory (P90): 144.0 [fallback] _regroup_keyed_tenors | B: 2048 | F: 160 | device: cuda | Runtime (P90): 0.7 ms | Memory (P90): 192.0 [prod] KeyedTensor.regroup | B: 2048 | F: 160 | device: cuda | Runtime (P90): 0.8 ms | Memory (P90): 288.0 [fallback] _regroup_keyed_tenors | B: 2048 | F: 320 | device: cuda | Runtime (P90): 1.6 ms | Memory (P90): 384.0 [prod] KeyedTensor.regroup | B: 2048 | F: 320 | device: cuda | Runtime (P90): 1.4 ms | Memory (P90): 576.0 [fallback] _regroup_keyed_tenors | B: 2048 | F: 640 | device: cuda | Runtime (P90): 4.8 ms | Memory (P90): 768.0 [prod] KeyedTensor.regroup | B: 2048 | F: 640 | device: cuda | Runtime (P90): 2.8 ms | Memory (P90): 1152.0 [fallback] _regroup_keyed_tenors | B: 2048 | F: 1280 | device: cuda | Runtime (P90): 12.5 ms | Memory (P90): 1536.0 [prod] KeyedTensor.regroup | B: 2048 | F: 1280 | device: cuda | Runtime (P90): 5.6 ms | Memory (P90): 2304.0 [fallback] _regroup_keyed_tenors | B: 4096 | F: 80 | device: cuda | Runtime (P90): 0.4 ms | Memory (P90): 192.0 [prod] KeyedTensor.regroup | B: 4096 | F: 80 | device: cuda | Runtime (P90): 0.8 ms | Memory (P90): 288.0 [fallback] _regroup_keyed_tenors | B: 4096 | F: 160 | device: cuda | Runtime (P90): 0.9 ms | Memory (P90): 384.0 [prod] KeyedTensor.regroup | B: 4096 | F: 160 | device: cuda | Runtime (P90): 1.4 ms | Memory (P90): 576.0 [fallback] _regroup_keyed_tenors | B: 4096 | F: 320 | device: cuda | Runtime (P90): 1.7 ms | Memory (P90): 768.0 [prod] KeyedTensor.regroup | B: 4096 | F: 320 | device: cuda | Runtime (P90): 2.8 ms | Memory (P90): 1152.0 [fallback] _regroup_keyed_tenors | B: 4096 | F: 640 | device: cuda | Runtime (P90): 4.1 ms | Memory (P90): 1536.0 [prod] KeyedTensor.regroup | B: 4096 | F: 640 | device: cuda | Runtime (P90): 5.6 ms | Memory (P90): 2304.0 [fallback] _regroup_keyed_tenors | B: 4096 | F: 1280 | device: cuda | Runtime (P90): 12.2 ms | Memory (P90): 3072.0 [prod] KeyedTensor.regroup | B: 4096 | F: 1280 | device: cuda | Runtime (P90): 11.1 ms | Memory (P90): 4608.0 Benchmark Results [Fowrard + Backward] [prod] KeyedTensor.regroup | B: 512 | F: 80 | device: cuda | Runtime (P90): 2.2 ms | Memory (P90): 72.0 [fallback] _regroup_keyed_tenors | B: 512 | F: 160 | device: cuda | Runtime (P90): 4.7 ms | Memory (P90): 144.0 [prod] KeyedTensor.regroup | B: 512 | F: 160 | device: cuda | Runtime (P90): 3.4 ms | Memory (P90): 144.0 [fallback] _regroup_keyed_tenors | B: 512 | F: 320 | device: cuda | Runtime (P90): 9.0 ms | Memory (P90): 288.0 [prod] KeyedTensor.regroup | B: 512 | F: 320 | device: cuda | Runtime (P90): 6.5 ms | Memory (P90): 288.0 [fallback] _regroup_keyed_tenors | B: 512 | F: 640 | device: cuda | Runtime (P90): 19.9 ms | Memory (P90): 576.0 [prod] KeyedTensor.regroup | B: 512 | F: 640 | device: cuda | Runtime (P90): 11.4 ms | Memory (P90): 576.0 [fallback] _regroup_keyed_tenors | B: 512 | F: 1280 | device: cuda | Runtime (P90): 46.7 ms | Memory (P90): 1152.0 [prod] KeyedTensor.regroup | B: 512 | F: 1280 | device: cuda | Runtime (P90): 23.1 ms | Memory (P90): 1152.0 [fallback] _regroup_keyed_tenors | B: 1024 | F: 80 | device: cuda | Runtime (P90): 2.6 ms | Memory (P90): 144.0 [prod] KeyedTensor.regroup | B: 1024 | F: 80 | device: cuda | Runtime (P90): 2.5 ms | Memory (P90): 144.0 [fallback] _regroup_keyed_tenors | B: 1024 | F: 160 | device: cuda | Runtime (P90): 4.5 ms | Memory (P90): 288.0 [prod] KeyedTensor.regroup | B: 1024 | F: 160 | device: cuda | Runtime (P90): 3.9 ms | Memory (P90): 288.0 [fallback] _regroup_keyed_tenors | B: 1024 | F: 320 | device: cuda | Runtime (P90): 8.8 ms | Memory (P90): 576.0 [prod] KeyedTensor.regroup | B: 1024 | F: 320 | device: cuda | Runtime (P90): 6.7 ms | Memory (P90): 576.0 [fallback] _regroup_keyed_tenors | B: 1024 | F: 640 | device: cuda | Runtime (P90): 18.7 ms | Memory (P90): 1152.0 [prod] KeyedTensor.regroup | B: 1024 | F: 640 | device: cuda | Runtime (P90): 12.2 ms | Memory (P90): 1152.0 [fallback] _regroup_keyed_tenors | B: 1024 | F: 1280 | device: cuda | Runtime (P90): 42.8 ms | Memory (P90): 2304.0 [prod] KeyedTensor.regroup | B: 1024 | F: 1280 | device: cuda | Runtime (P90): 23.1 ms | Memory (P90): 2304.0 [fallback] _regroup_keyed_tenors | B: 2048 | F: 80 | device: cuda | Runtime (P90): 2.5 ms | Memory (P90): 288.0 [prod] KeyedTensor.regroup | B: 2048 | F: 80 | device: cuda | Runtime (P90): 2.4 ms | Memory (P90): 288.0 [fallback] _regroup_keyed_tenors | B: 2048 | F: 160 | device: cuda | Runtime (P90): 4.5 ms | Memory (P90): 576.0 [prod] KeyedTensor.regroup | B: 2048 | F: 160 | device: cuda | Runtime (P90): 4.2 ms | Memory (P90): 576.0 [fallback] _regroup_keyed_tenors | B: 2048 | F: 320 | device: cuda | Runtime (P90): 8.9 ms | Memory (P90): 1152.0 [prod] KeyedTensor.regroup | B: 2048 | F: 320 | device: cuda | Runtime (P90): 7.7 ms | Memory (P90): 1152.0 [fallback] _regroup_keyed_tenors | B: 2048 | F: 640 | device: cuda | Runtime (P90): 19.2 ms | Memory (P90): 2304.0 [prod] KeyedTensor.regroup | B: 2048 | F: 640 | device: cuda | Runtime (P90): 12.9 ms | Memory (P90): 2304.0 [fallback] _regroup_keyed_tenors | B: 2048 | F: 1280 | device: cuda | Runtime (P90): 45.1 ms | Memory (P90): 4608.0 [prod] KeyedTensor.regroup | B: 2048 | F: 1280 | device: cuda | Runtime (P90): 26.4 ms | Memory (P90): 4608.0 [fallback] _regroup_keyed_tenors | B: 4096 | F: 80 | device: cuda | Runtime (P90): 2.4 ms | Memory (P90): 576.0 [prod] KeyedTensor.regroup | B: 4096 | F: 80 | device: cuda | Runtime (P90): 2.7 ms | Memory (P90): 576.0 [fallback] _regroup_keyed_tenors | B: 4096 | F: 160 | device: cuda | Runtime (P90): 4.4 ms | Memory (P90): 1152.0 [prod] KeyedTensor.regroup | B: 4096 | F: 160 | device: cuda | Runtime (P90): 4.4 ms | Memory (P90): 1152.0 [fallback] _regroup_keyed_tenors | B: 4096 | F: 320 | device: cuda | Runtime (P90): 8.4 ms | Memory (P90): 2304.0 [prod] KeyedTensor.regroup | B: 4096 | F: 320 | device: cuda | Runtime (P90): 8.1 ms | Memory (P90): 2304.0 [fallback] _regroup_keyed_tenors | B: 4096 | F: 640 | device: cuda | Runtime (P90): 28.0 ms | Memory (P90): 4608.0 [prod] KeyedTensor.regroup | B: 4096 | F: 640 | device: cuda | Runtime (P90): 15.6 ms | Memory (P90): 4608.0 [fallback] _regroup_keyed_tenors | B: 4096 | F: 1280 | device: cuda | Runtime (P90): 43.2 ms | Memory (P90): 9216.0 [prod] KeyedTensor.regroup | B: 4096 | F: 1280 | device: cuda | Runtime (P90): 31.2 ms | Memory (P90): 9216.0 Reviewed By: PaulZhang12 Differential Revision: D56392296
d172d1e
to
12385ab
Compare
This pull request was exported from Phabricator. Differential Revision: D56392296 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
CLA Signed
This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
fb-exported
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary:
Use custom FBGEMM kernel when possible for inference/training. ~0-75% runtime speedup.
Benchmark Results [Forward]
[fallback] _regroup_keyed_tenors | B: 512 | F: 80 | device: cuda | Runtime (P90): 0.4 ms | Memory (P90): 24.0
[prod] KeyedTensor.regroup | B: 512 | F: 80 | device: cuda | Runtime (P90): 0.4 ms | Memory (P90): 36.0
[fallback] _regroup_keyed_tenors | B: 512 | F: 160 | device: cuda | Runtime (P90): 0.8 ms | Memory (P90): 48.0
[prod] KeyedTensor.regroup | B: 512 | F: 160 | device: cuda | Runtime (P90): 0.6 ms | Memory (P90): 72.0
[fallback] _regroup_keyed_tenors | B: 512 | F: 320 | device: cuda | Runtime (P90): 1.9 ms | Memory (P90): 96.0
[prod] KeyedTensor.regroup | B: 512 | F: 320 | device: cuda | Runtime (P90): 0.7 ms | Memory (P90): 144.0
[fallback] _regroup_keyed_tenors | B: 512 | F: 640 | device: cuda | Runtime (P90): 4.6 ms | Memory (P90): 192.0
[prod] KeyedTensor.regroup | B: 512 | F: 640 | device: cuda | Runtime (P90): 1.3 ms | Memory (P90): 288.0
[fallback] _regroup_keyed_tenors | B: 512 | F: 1280 | device: cuda | Runtime (P90): 13.2 ms | Memory (P90): 384.0
[prod] KeyedTensor.regroup | B: 512 | F: 1280 | device: cuda | Runtime (P90): 2.2 ms | Memory (P90): 576.0
[fallback] _regroup_keyed_tenors | B: 1024 | F: 80 | device: cuda | Runtime (P90): 0.3 ms | Memory (P90): 48.0
[prod] KeyedTensor.regroup | B: 1024 | F: 80 | device: cuda | Runtime (P90): 0.4 ms | Memory (P90): 72.0
[fallback] _regroup_keyed_tenors | B: 1024 | F: 160 | device: cuda | Runtime (P90): 0.8 ms | Memory (P90): 96.0
[prod] KeyedTensor.regroup | B: 1024 | F: 160 | device: cuda | Runtime (P90): 0.6 ms | Memory (P90): 144.0
[fallback] _regroup_keyed_tenors | B: 1024 | F: 320 | device: cuda | Runtime (P90): 1.8 ms | Memory (P90): 192.0
[prod] KeyedTensor.regroup | B: 1024 | F: 320 | device: cuda | Runtime (P90): 0.9 ms | Memory (P90): 288.0
[fallback] _regroup_keyed_tenors | B: 1024 | F: 640 | device: cuda | Runtime (P90): 4.1 ms | Memory (P90): 384.0
[prod] KeyedTensor.regroup | B: 1024 | F: 640 | device: cuda | Runtime (P90): 1.6 ms | Memory (P90): 576.0
[fallback] _regroup_keyed_tenors | B: 1024 | F: 1280 | device: cuda | Runtime (P90): 12.8 ms | Memory (P90): 768.0
[prod] KeyedTensor.regroup | B: 1024 | F: 1280 | device: cuda | Runtime (P90): 3.1 ms | Memory (P90): 1152.0
[fallback] _regroup_keyed_tenors | B: 2048 | F: 80 | device: cuda | Runtime (P90): 0.4 ms | Memory (P90): 96.0
[prod] KeyedTensor.regroup | B: 2048 | F: 80 | device: cuda | Runtime (P90): 0.5 ms | Memory (P90): 144.0
[fallback] _regroup_keyed_tenors | B: 2048 | F: 160 | device: cuda | Runtime (P90): 0.7 ms | Memory (P90): 192.0
[prod] KeyedTensor.regroup | B: 2048 | F: 160 | device: cuda | Runtime (P90): 0.8 ms | Memory (P90): 288.0
[fallback] _regroup_keyed_tenors | B: 2048 | F: 320 | device: cuda | Runtime (P90): 1.6 ms | Memory (P90): 384.0
[prod] KeyedTensor.regroup | B: 2048 | F: 320 | device: cuda | Runtime (P90): 1.4 ms | Memory (P90): 576.0
[fallback] _regroup_keyed_tenors | B: 2048 | F: 640 | device: cuda | Runtime (P90): 4.8 ms | Memory (P90): 768.0
[prod] KeyedTensor.regroup | B: 2048 | F: 640 | device: cuda | Runtime (P90): 2.8 ms | Memory (P90): 1152.0
[fallback] _regroup_keyed_tenors | B: 2048 | F: 1280 | device: cuda | Runtime (P90): 12.5 ms | Memory (P90): 1536.0
[prod] KeyedTensor.regroup | B: 2048 | F: 1280 | device: cuda | Runtime (P90): 5.6 ms | Memory (P90): 2304.0
[fallback] _regroup_keyed_tenors | B: 4096 | F: 80 | device: cuda | Runtime (P90): 0.4 ms | Memory (P90): 192.0
[prod] KeyedTensor.regroup | B: 4096 | F: 80 | device: cuda | Runtime (P90): 0.8 ms | Memory (P90): 288.0
[fallback] _regroup_keyed_tenors | B: 4096 | F: 160 | device: cuda | Runtime (P90): 0.9 ms | Memory (P90): 384.0
[prod] KeyedTensor.regroup | B: 4096 | F: 160 | device: cuda | Runtime (P90): 1.4 ms | Memory (P90): 576.0
[fallback] _regroup_keyed_tenors | B: 4096 | F: 320 | device: cuda | Runtime (P90): 1.7 ms | Memory (P90): 768.0
[prod] KeyedTensor.regroup | B: 4096 | F: 320 | device: cuda | Runtime (P90): 2.8 ms | Memory (P90): 1152.0
[fallback] _regroup_keyed_tenors | B: 4096 | F: 640 | device: cuda | Runtime (P90): 4.1 ms | Memory (P90): 1536.0
[prod] KeyedTensor.regroup | B: 4096 | F: 640 | device: cuda | Runtime (P90): 5.6 ms | Memory (P90): 2304.0
[fallback] _regroup_keyed_tenors | B: 4096 | F: 1280 | device: cuda | Runtime (P90): 12.2 ms | Memory (P90): 3072.0
[prod] KeyedTensor.regroup | B: 4096 | F: 1280 | device: cuda | Runtime (P90): 11.1 ms | Memory (P90): 4608.0
Benchmark Results [Fowrard + Backward]
[prod] KeyedTensor.regroup | B: 512 | F: 80 | device: cuda | Runtime (P90): 2.2 ms | Memory (P90): 72.0
[fallback] _regroup_keyed_tenors | B: 512 | F: 160 | device: cuda | Runtime (P90): 4.7 ms | Memory (P90): 144.0
[prod] KeyedTensor.regroup | B: 512 | F: 160 | device: cuda | Runtime (P90): 3.4 ms | Memory (P90): 144.0
[fallback] _regroup_keyed_tenors | B: 512 | F: 320 | device: cuda | Runtime (P90): 9.0 ms | Memory (P90): 288.0
[prod] KeyedTensor.regroup | B: 512 | F: 320 | device: cuda | Runtime (P90): 6.5 ms | Memory (P90): 288.0
[fallback] _regroup_keyed_tenors | B: 512 | F: 640 | device: cuda | Runtime (P90): 19.9 ms | Memory (P90): 576.0
[prod] KeyedTensor.regroup | B: 512 | F: 640 | device: cuda | Runtime (P90): 11.4 ms | Memory (P90): 576.0
[fallback] _regroup_keyed_tenors | B: 512 | F: 1280 | device: cuda | Runtime (P90): 46.7 ms | Memory (P90): 1152.0
[prod] KeyedTensor.regroup | B: 512 | F: 1280 | device: cuda | Runtime (P90): 23.1 ms | Memory (P90): 1152.0
[fallback] _regroup_keyed_tenors | B: 1024 | F: 80 | device: cuda | Runtime (P90): 2.6 ms | Memory (P90): 144.0
[prod] KeyedTensor.regroup | B: 1024 | F: 80 | device: cuda | Runtime (P90): 2.5 ms | Memory (P90): 144.0
[fallback] _regroup_keyed_tenors | B: 1024 | F: 160 | device: cuda | Runtime (P90): 4.5 ms | Memory (P90): 288.0
[prod] KeyedTensor.regroup | B: 1024 | F: 160 | device: cuda | Runtime (P90): 3.9 ms | Memory (P90): 288.0
[fallback] _regroup_keyed_tenors | B: 1024 | F: 320 | device: cuda | Runtime (P90): 8.8 ms | Memory (P90): 576.0
[prod] KeyedTensor.regroup | B: 1024 | F: 320 | device: cuda | Runtime (P90): 6.7 ms | Memory (P90): 576.0
[fallback] _regroup_keyed_tenors | B: 1024 | F: 640 | device: cuda | Runtime (P90): 18.7 ms | Memory (P90): 1152.0
[prod] KeyedTensor.regroup | B: 1024 | F: 640 | device: cuda | Runtime (P90): 12.2 ms | Memory (P90): 1152.0
[fallback] _regroup_keyed_tenors | B: 1024 | F: 1280 | device: cuda | Runtime (P90): 42.8 ms | Memory (P90): 2304.0
[prod] KeyedTensor.regroup | B: 1024 | F: 1280 | device: cuda | Runtime (P90): 23.1 ms | Memory (P90): 2304.0
[fallback] _regroup_keyed_tenors | B: 2048 | F: 80 | device: cuda | Runtime (P90): 2.5 ms | Memory (P90): 288.0
[prod] KeyedTensor.regroup | B: 2048 | F: 80 | device: cuda | Runtime (P90): 2.4 ms | Memory (P90): 288.0
[fallback] _regroup_keyed_tenors | B: 2048 | F: 160 | device: cuda | Runtime (P90): 4.5 ms | Memory (P90): 576.0
[prod] KeyedTensor.regroup | B: 2048 | F: 160 | device: cuda | Runtime (P90): 4.2 ms | Memory (P90): 576.0
[fallback] _regroup_keyed_tenors | B: 2048 | F: 320 | device: cuda | Runtime (P90): 8.9 ms | Memory (P90): 1152.0
[prod] KeyedTensor.regroup | B: 2048 | F: 320 | device: cuda | Runtime (P90): 7.7 ms | Memory (P90): 1152.0
[fallback] _regroup_keyed_tenors | B: 2048 | F: 640 | device: cuda | Runtime (P90): 19.2 ms | Memory (P90): 2304.0
[prod] KeyedTensor.regroup | B: 2048 | F: 640 | device: cuda | Runtime (P90): 12.9 ms | Memory (P90): 2304.0
[fallback] _regroup_keyed_tenors | B: 2048 | F: 1280 | device: cuda | Runtime (P90): 45.1 ms | Memory (P90): 4608.0
[prod] KeyedTensor.regroup | B: 2048 | F: 1280 | device: cuda | Runtime (P90): 26.4 ms | Memory (P90): 4608.0
[fallback] _regroup_keyed_tenors | B: 4096 | F: 80 | device: cuda | Runtime (P90): 2.4 ms | Memory (P90): 576.0
[prod] KeyedTensor.regroup | B: 4096 | F: 80 | device: cuda | Runtime (P90): 2.7 ms | Memory (P90): 576.0
[fallback] _regroup_keyed_tenors | B: 4096 | F: 160 | device: cuda | Runtime (P90): 4.4 ms | Memory (P90): 1152.0
[prod] KeyedTensor.regroup | B: 4096 | F: 160 | device: cuda | Runtime (P90): 4.4 ms | Memory (P90): 1152.0
[fallback] _regroup_keyed_tenors | B: 4096 | F: 320 | device: cuda | Runtime (P90): 8.4 ms | Memory (P90): 2304.0
[prod] KeyedTensor.regroup | B: 4096 | F: 320 | device: cuda | Runtime (P90): 8.1 ms | Memory (P90): 2304.0
[fallback] _regroup_keyed_tenors | B: 4096 | F: 640 | device: cuda | Runtime (P90): 28.0 ms | Memory (P90): 4608.0
[prod] KeyedTensor.regroup | B: 4096 | F: 640 | device: cuda | Runtime (P90): 15.6 ms | Memory (P90): 4608.0
[fallback] _regroup_keyed_tenors | B: 4096 | F: 1280 | device: cuda | Runtime (P90): 43.2 ms | Memory (P90): 9216.0
[prod] KeyedTensor.regroup | B: 4096 | F: 1280 | device: cuda | Runtime (P90): 31.2 ms | Memory (P90): 9216.0
Differential Revision: D56392296